1 Background

After the variant call process the VCF file generated contains some artefacts that affect downstream analysis. Alignment artifacts can occur whenever there is sufficient sequence similarity between two or more regions in the genome to confuse the alignment algorithm. This can occur when the aligner for whatever reason overestimate how uniquely a read maps, thereby assigning it too high of a mapping quality. It can also occur through no fault of the aligner due to gaps in the reference, which can also hide the true position to which a read should map. Even thought GATK solve most of these artifacts some still persists. In this tutorial we will filter out artifacts from our variant calls and generate a ‘cleaner’ set of loci for downstream analysis.

2 Methods

2.1 Section 1: Filtering of VCF files

2.1.1 Setp 1: Filter of non core regions, coding regions, and samples of interest.

For this step, the .vcf file is still too large to be manipulated in our personal computer, for that reason this step had been P. vivax has several genomic regions like the vir genes that are hard to align because they

For this first step we require the following files:

  • .vcf.gz or .vcf files: If a .vcf.gz is used, a .vcf.tbi will be also required. For this tutorial the .vfc.gz file that we are going to work with is: ColombiaGates_Pviv.JointCall.filtered.combined.snpeff.vcf.gz.

  • .gff: the gff file of the reference genome will be also required in order to filter coding regions.

  • noncore_regions.bed: This file will be used to remove non-core regions that are difficult to sequence from the analysis.

  • load_libraries.R: This R script file load all the required packages that we need to perform the analysis.

  • functions.R: This R script file load all the functions that we have created to perform the analysis. Within those functions we a function named fx_run_vcftools that allow us to manipulate the vcftools package from the R environment. We have also created other functions to get information from the .vcf file with loading the whole file in R environment. This functions will help us the get information regarding to the name of the samples and create a samples.indv to use as a filter in the analysis. Other functions will be explained later in this tutorial.

  • Filtering_vcf_file.R: This R script file contains the instructions to perform the analysis in R. This script requires 9 arguments:

    • wd: Working directory where all temporal files and final outputs will be stored.

    • fd: Directory where the pre-required files are stored.

    • gzvcf: Path to the .vcf.gz file.

    • o: prefix used for the temporal and final outputs.

    • rkeep: regular expression to identify samples of interest.

    • ebed: name of the noncore_regions.bed file. This file should be stored in the fd path.

    • gff: name of the .gff file. This file should be stored in the fd path.

    • n: number of iterations. In the final step of this script we are going to transform our data from VCF format to rGenome format. This step is computing exhausting and consumes RAM memory. For that reason we subdivide this step in n instances to be able to run in the server.

    • thres: Minimum read depth to call an allele in a sample.

  • r_options.sh: bash script with the instructions to run R. This file should be stored in wd. The scrip should looks as follow:

##!/bin/bash
# source /broad/software/scripts/useuse
# use R-4.1
# Rscript /gsap/garage-protistvector/ColombiaData/Pviv/AnalysisTools/Filtering_vcf_file.R \
#   -wd /gsap/garage-protistvector/ColombiaData/Pviv/DataGeneration/post_vcall_analysis/ \
#   -fd /gsap/garage-protistvector/ColombiaData/Pviv/AnalysisTools \
#   -gzvcf /gsap/garage-protistvector/ColombiaData/Pviv/DataGeneration/ColombiaGates_Pviv.JointCall.filtered.combined.snpeff.vcf.gz \
#   -o ColombiaGates \
#   -rkeep (Col|SP) \
#   -ebed Pvivax_P01_noncore.bed \
#   -gff genes.gff \
#   -n 500 \
#   -thres 5
  • uger_options.sh: bash script with the instruction to run UGER. This file should be stored in wd. The scrip should looks as follow:
##!/bin/bash
#qsub -l h_vmem=32G \
#   -l h_rt=01:00:00 \
#   -o /gsap/garage-protistvector/ColombiaData/Pviv/DataGeneration/post_vcall_analysis/output/ \
#   /gsap/garage-protistvector/ColombiaData/Pviv/DataGeneration/post_vcall_analysis/r_options.sh

As a final output this step will generate and .RData file containing the data in rGenome format.

2.1.2 Step 2: Filter of potential PCR genotyping artifacts (Missing data, Homopolymers & Short Tandem Repeats)

  1. Load the rGenome Object and required libraries and functions:
setwd('~/Documents/Github/Plasmodium_WGS_analysis/Pre_filtering/')

source('../functions_libraries/load_libraries.R')

for(file in list.files('PerVen/Pfal/', pattern = '.RData')){
  load(file.path('PerVen/Pfal/', file))
  rm(list = ls()[!grepl('Pf3D7_(\\d+|MIT|API)_v3_rGenome_object', ls())])
}

rGenome_objects = list(Pf3D7_01_v3_rGenome_object,
     Pf3D7_02_v3_rGenome_object,
     Pf3D7_03_v3_rGenome_object,
     Pf3D7_04_v3_rGenome_object,
     Pf3D7_05_v3_rGenome_object,
     Pf3D7_06_v3_rGenome_object,
     Pf3D7_07_v3_rGenome_object,
     Pf3D7_08_v3_rGenome_object,
     Pf3D7_09_v3_rGenome_object,
     Pf3D7_10_v3_rGenome_object,
     Pf3D7_11_v3_rGenome_object,
     Pf3D7_12_v3_rGenome_object,
     Pf3D7_13_v3_rGenome_object,
     Pf3D7_14_v3_rGenome_object)

rm(list = c('Pf3D7_01_v3_rGenome_object',
     'Pf3D7_02_v3_rGenome_object',
     'Pf3D7_03_v3_rGenome_object',
     'Pf3D7_04_v3_rGenome_object',
     'Pf3D7_05_v3_rGenome_object',
     'Pf3D7_06_v3_rGenome_object',
     'Pf3D7_07_v3_rGenome_object',
     'Pf3D7_08_v3_rGenome_object',
     'Pf3D7_09_v3_rGenome_object',
     'Pf3D7_10_v3_rGenome_object',
     'Pf3D7_11_v3_rGenome_object',
     'Pf3D7_12_v3_rGenome_object',
     'Pf3D7_13_v3_rGenome_object',
     'Pf3D7_14_v3_rGenome_object'))

source('../functions_libraries/functions.R')
#sourceCpp('../functions_libraries/Rcpp_functions.cpp')

Pfal_rGenome_object = rGenome()

for(obj in 1:length(rGenome_objects)){
  Pfal_rGenome_object@data$gt = rbind(Pfal_rGenome_object@data$gt, rGenome_objects[[obj]]@data$gt)
  Pfal_rGenome_object@data$loci_table = rbind(Pfal_rGenome_object@data$loci_table, rGenome_objects[[obj]]@data$loci_table)
}

Pfal_rGenome_object@data$metadata = rbind(Pfal_rGenome_object@data$metadata, rGenome_objects[[obj]]@data$metadata)

rm(list = c('rGenome_objects', 'obj'))
  1. Add Metadata.
# There are some spelling mistakes in the sample codes of the sequencing files, The table VENPER2019_2022.csv match the incorrect codes with the correct codes
external_metadata = read.csv('../GeneticDiversity/merged_metadata.csv')
# PerVen_seq_codes = read.csv('PerVen/VENPER2019_2022.csv')

# Remove Duplicated records
external_metadata = external_metadata[!duplicated(external_metadata$Sample_id),]
# PerVen_seq_codes = PerVen_seq_codes[!duplicated(PerVen_seq_codes$PreferedSampleID),]

# merge incorrect and correcte codes with metadata
# external_metadata = merge(external_metadata, PerVen_seq_codes, by.x = 'Sample_id', by.y = 'AlternateSampleID', all.x = T)

# external_metadata %<>% mutate(PreferedSampleID = case_when(
#   is.na(PreferedSampleID) ~ Sample_id,
#   !is.na(PreferedSampleID) ~ PreferedSampleID
# ))

# Add metadata to the rGenome object
Pfal_rGenome_object@data$metadata = merge(Pfal_rGenome_object@data$metadata, external_metadata[,c("Sample_id", "Study", "Country", 'Date_of_Collection', "Year_of_Collection",'Subnational_level0', 'Subnational_level1', "Subnational_level2")], by.x = 'Sample_id', by.y = 'Sample_id', all.x = T)

# Sorting samples
rownames(Pfal_rGenome_object@data$metadata) =
  Pfal_rGenome_object@data$metadata$Sample_id

Pfal_rGenome_object@data$metadata =
  Pfal_rGenome_object@data$metadata[colnames(Pfal_rGenome_object@data$gt),]

2.1.2.1 Step 2.1: Missing data

  1. Remove samples with low amplification rate.

3.1 Sample amplification rate

Pfal_rGenome_object = SampleAmplRate(Pfal_rGenome_object)

3.2 Check the distribution of the amplification rate of the samples using a histogram

Pfal_rGenome_object@data$metadata %>% ggplot(aes(x = SampleAmplRate, fill = Country))+
  geom_histogram(position = 'stack', binwidth = .05, color = 'gray40')+
  theme_bw()+
  labs(title = 'Sample Amplification Rate distribution',
       x = 'Amplification Rate')

3.3 Remove samples with less than 75% of amplified loci (> 25% of missing data)

Pfal_rGenome_object = filter_samples(obj = Pfal_rGenome_object, v =
                Pfal_rGenome_object@data$metadata$SampleAmplRate >= .75)
  1. Remove monomorphic sites

4.1 Update allele counts

Pfal_rGenome_object@data$loci_table = cbind(Pfal_rGenome_object@data$loci_table,
                                       get_AC(obj = Pfal_rGenome_object))

4.2 Filter monomorphic sites

Pfal_rGenome_object = filter_loci(Pfal_rGenome_object,
                             v = Pfal_rGenome_object@data$loci_table$Cardinality > 1)
  1. Remove loci with high missing data

5.1 Calculate the amplification rate of each loci

Pfal_rGenome_object = LocusAmplRate(Pfal_rGenome_object)

5.2 Check the distribution of the amplification rate of the samples using a histogram

Pfal_rGenome_object@data$loci_table %>%
    mutate(CHROM2  = gsub('(^(\\d|P|v)+_|_v1)', '',CHROM))%>%
    ggplot(aes(x = POS, y = LocusAmplRate)) +
    geom_point(alpha = 0.7, size = .25) +
    facet_grid(.~ CHROM2)+
    theme_bw()+
    labs(y = 'Ampl. Rate', x = 'Chromosomal position')+
    theme(legend.position = 'right',
          axis.text.x = element_blank())

5.3 Filter loci with amplification rate below .75

Pfal_rGenome_object = filter_loci(Pfal_rGenome_object, v = Pfal_rGenome_object@data$loci_table$LocusAmplRate >= .75)
  1. Update alternative alleles
Pfal_rGenome_object@data$loci_table$ALT =
  ifelse(Pfal_rGenome_object@data$loci_table$REF !=
           gsub(',([ATCG]|\\*)+',
                '',
                gsub(':\\d+',
                     '',
                     Pfal_rGenome_object@data$loci_table$Alleles)),
         gsub(':\\d+',
              '',
              Pfal_rGenome_object@data$loci_table$Alleles),
         gsub('^([ATCG]|\\*)+,',
              '',
              gsub(':\\d+', '', Pfal_rGenome_object@data$loci_table$Alleles))
       )

2.1.2.2 Step 2.2: Homopolymers and Short tandem repeats

  1. Differentiate between SNPs, INDELS, Homopolymers and Short tandem repeats
Pfal_rGenome_object@data$loci_table$TypeOf_Markers =
  TypeOf_Marker(Pfal_rGenome_object, w = 1, n = 1)
2.1.2.2.1 Effect of PCR genotyping artifacts on heterozygosity
  1. Get Proportion of Heterozygous samples per site
Pfal_rGenome_object@data$loci_table$ObsHet = get_ObsHet(Pfal_rGenome_object, by = 'loci', w = 1, n = 1)
  1. Calculate fraction of Heterozygous samples per Alternative alleles per site
Pfal_rGenome_object@data$loci_table$frac_ofHet_pAlts =
  frac_ofHet_pAlt(Pfal_rGenome_object, w = 1, n = 1)
  1. Classify sites based on the proportion of alternative alleles in polyclonal samples per site

10.1 Distribution of the fraction of alternative alleles in polyclonal samples

Pfal_rGenome_object@data$loci_table %>% ggplot(aes(x = frac_ofHet_pAlts))+
    geom_histogram(binwidth = 0.01)+
    labs(x = 'Fraction of heterozygous samples per alternative allele per site',
         y = 'Number of sites (Loci)')+
    theme_bw()

10.2 Classify sites based on the proportion of alternative alleles in polyclonal samples per site

Pfal_rGenome_object@data$loci_table %<>% mutate(ALT_FILTER =case_when(
    frac_ofHet_pAlts == 1 ~ '100%',
    frac_ofHet_pAlts < 1 & frac_ofHet_pAlts > .5 ~ '50 - 99%',
    frac_ofHet_pAlts <= .5 ~ '<=50%',
  ))

10.3 Distribution of observed heterozygosity per locus, by type of marker, by group (defined in previuos step 10.2)

Pfal_rGenome_object@data$loci_table %>%
    ggplot(aes(x = ObsHet,
               fill = factor(ALT_FILTER,
                             levels = c('<=50%', '50 - 99%', '100%'))))+
    geom_histogram(binwidth = .01)+
    scale_fill_manual(values = c('dodgerblue3', 'gold3', 'firebrick3'))+
    facet_wrap(.~factor(TypeOf_Markers, levels = c(
      'SNP',
      'INDEL',
      'INDEL:Homopolymer',
      'INDEL:Dinucleotide_STR',
      'INDEL:Trinucleotide_STR',
      'INDEL:Tetranucleotide_STR',
      'INDEL:Pentanucleotide_STR',
      'INDEL:Hexanucleotide_STR'
    )), scales = 'free_y', ncol = 4)+
    labs(y = 'Number of Loci',
         x = 'Observed Heterozygosity per locus',
         fill = 'Het/Alt')+
    theme_bw()

  1. Remove Homopolymers and DiSTRs
Pfal_rGenome_object = filter_loci(Pfal_rGenome_object,
            v = !(Pfal_rGenome_object@data$loci_table$TypeOf_Markers %in%
                    c('INDEL:Homopolymer', 'INDEL:Dinucleotide_STR')))

2.1.3 Step 3: Filter of problematic genomic regions to map reads

2.1.3.1 Regions with excess of Heterozygosity

  1. Distribution of proportion of heterozygous per site by sites with <= 50% Het/Alt, 50 - 99% Het/Alt and 100% of Het/Alt
  # In that way we identify that maximum fraction of Heterozygous samples per loci is 0.4855491
  
Pfal_rGenome_object@data$loci_table %>%
    mutate(TypeOf_Markers2 = case_when(
      TypeOf_Markers == 'SNP' ~ 'SNP',
      TypeOf_Markers != 'SNP' ~ 'INDELs'),
      CHROM2  = gsub('(^(\\d|P|v)+_|_v1)', '',CHROM))%>%
    ggplot(aes(x = POS, y = ObsHet, color = factor(ALT_FILTER, levels = c('<=50%', '50 - 99%', '100%')))) +
    geom_point(alpha = 0.5, size = .25) +
    scale_color_manual(values = c('dodgerblue3', 'gold3', 'firebrick3'))+
    facet_grid(TypeOf_Markers2 ~ CHROM2)+
    theme_bw()+
    labs(y = 'Observed Heterozygosity', x = 'Chromosomal position', color = 'Het/Alt')+
    theme(legend.position = 'right',
          axis.text.x = element_blank(),
          axis.title.x = element_blank())

  1. Calculate the average of heterozygous samples per site per gene
Pfal_rGenome_object@data$loci_table = mean_ObsHet(Pfal_rGenome_object, gff = '../reference/PlasmoDB-59_Pfalciparum3D7.gff')
  1. Define a threshold for trusted variants
ObsHet_Threshold = quantile(Pfal_rGenome_object@data$loci_table %>%
                                filter(ALT_FILTER == '<=50%', TypeOf_Markers == 'SNP') %>%
                                group_by(gene_id) %>%
                                dplyr::summarise(mean_ObsHet = max(mean_ObsHet)) %>%
                                select(mean_ObsHet) %>%
                                unlist, .95)
  1. Visualize the Average Observed Heteozygosity in SNPs per genomic position
Pfal_rGenome_object@data$loci_table %>%
    filter(ALT_FILTER == '<=50%', TypeOf_Markers == 'SNP') %>%
    group_by(gene_id) %>%
    dplyr::summarise(mean_ObsHet = max(mean_ObsHet)) %>%
    ggplot(aes(x = mean_ObsHet))+
    geom_histogram(binwidth = .005)+
    geom_vline(xintercept = ObsHet_Threshold)+
    labs(y = 'Number of genomic regions', x = 'Average Observed Heteozygosity in SNPs') +
    theme_bw()

  1. Differentiate genomic regions that PASS the threshold
Pfal_rGenome_object@data$loci_table %<>% mutate(ObsHet_Filter = case_when(
    mean_ObsHet > ObsHet_Threshold ~ 'OUT',
    mean_ObsHet <= ObsHet_Threshold ~ 'PASS'
  ))
  1. Visualize which genomic regions PASS the filter
Pfal_rGenome_object@data$loci_table %>%
    mutate(TypeOf_Markers2 = case_when(
      TypeOf_Markers == 'SNP' ~ 'SNP',
      TypeOf_Markers != 'SNP' ~ 'INDEL'),
      CHROM2  = gsub('(^(\\d|P|v)+_|_v1)', '',CHROM))%>%
    ggplot(aes(x = POS, y = ObsHet, color = ObsHet_Filter)) +
    geom_point(alpha = 0.5, size = .25) +
    geom_hline(yintercept = ObsHet_Threshold)+
    scale_color_manual(values = c('firebrick2', 'dodgerblue3'))+
    facet_grid(TypeOf_Markers2 ~ CHROM2)+
    theme_bw()+
    labs(y = 'Observed Heterozygosity',
         x = 'Chromosomal position',
         color = 'Mean ObsHet > Th')+
    theme(legend.position = 'right',
          axis.text.x = element_blank())

2.1.3.2 Regions with excess of density of Polymorphism

  1. Calculate the density of SNPs per genomic region and define a threshold
Pfal_rGenome_object@data$loci_table = SNP_density(Pfal_rGenome_object, gff = '../reference/PlasmoDB-59_Pfalciparum3D7.gff')

SNP_density_threshold = quantile(Pfal_rGenome_object@data$loci_table %>%
                                     filter(TypeOf_Markers == 'SNP') %>%
                                     group_by(gene_id) %>%
                                     dplyr::summarise(SNP_density = max(SNP_density)) %>%
                                     select(SNP_density) %>%
                                     unlist, .95)
  1. Distribution of SNP density
Pfal_rGenome_object@data$loci_table %>%
    filter(TypeOf_Markers == 'SNP') %>%
    group_by(gene_id) %>%
    dplyr::summarise(SNP_density = max(SNP_density)) %>%
    ggplot(aes(x = SNP_density))+
    geom_histogram(binwidth = .001)+
    geom_vline(xintercept = SNP_density_threshold)+
    labs(y = 'Number of genomic regions', x = 'SNP density') +
    theme_bw()

  1. Differentiate/Filter genomic regions that PASS the threshold
Pfal_rGenome_object@data$loci_table %<>% 
  mutate(SNP_density_Filter = case_when(
    SNP_density > SNP_density_threshold ~ 'OUT',
    SNP_density <= SNP_density_threshold ~ 'PASS'
  ))
  1. Visualize which genomic regions PASS the filter
Pfal_rGenome_object@data$loci_table %>%
    mutate(TypeOf_Markers2 = case_when(
      TypeOf_Markers == 'SNP' ~ 'SNP',
      TypeOf_Markers != 'SNP' ~ 'INDEL'),
      CHROM2  = gsub('(^(\\d|P|v)+_|_v1)', '',CHROM))%>%
    ggplot(aes(x = POS, y = ObsHet, color = SNP_density_Filter)) +
    geom_point(alpha = 0.5, size = .25) +
    scale_color_manual(values = c('firebrick2', 'dodgerblue3'))+
    facet_grid(TypeOf_Markers2 ~ CHROM2)+
    theme_bw()+
    labs(y = 'Observed Heterozygosity', x = 'Chromosomal position', color = 'SNP density')+
    theme(legend.position = 'right',
          axis.text.x = element_blank())

  1. Differentiate of problematic genomic regions to map reads
Pfal_rGenome_object@data$loci_table %<>%
  mutate(Alignment_Filter = case_when(
    ObsHet_Filter == 'OUT' & SNP_density_Filter == 'OUT' ~ 'OUT',
    !(ObsHet_Filter == 'OUT' & SNP_density_Filter == 'OUT') ~ 'PASS'
  ))
  1. Visualize which genomic regions PASS the filter
Pfal_rGenome_object@data$loci_table %>%
    mutate(TypeOf_Markers2 = case_when(
      TypeOf_Markers == 'SNP' ~ 'SNP',
      TypeOf_Markers != 'SNP' ~ 'INDEL'),
      CHROM2  = gsub('(^(\\d|P|v)+_|_v1)', '',CHROM))%>%
    ggplot(aes(x = POS, y = ObsHet, color = Alignment_Filter)) +
    geom_point(alpha = 0.5, size = .25) +
    scale_color_manual(values = c('firebrick2', 'dodgerblue3'))+
    facet_grid(TypeOf_Markers2 ~ CHROM2)+
    theme_bw()+
    labs(y = 'Observed Heterozygosity', x = 'Chromosomal position', color = 'Alignment\nFilter')+
    theme(legend.position = 'right',
          axis.text.x = element_blank())

  1. Check which genomic regions or genes are removed
Pfal_rGenome_object@data$loci_table %>% 
  filter(Alignment_Filter == 'OUT') %>% 
  select(gene_id) %>% 
  unique()
##                           gene_id
## Pf3D7_01_v3_542041  PF3D7_0114000
## Pf3D7_02_v3_863081  PF3D7_0221500
## Pf3D7_03_v3_259963  PF3D7_0305400
## Pf3D7_03_v3_1007410 PF3D7_0324100
## Pf3D7_04_v3_38961   PF3D7_0400200
## Pf3D7_04_v3_63131   PF3D7_0400700
## Pf3D7_04_v3_89170   PF3D7_0401500
## Pf3D7_04_v3_156081  PF3D7_0402500
## Pf3D7_04_v3_1136672 PF3D7_0425200
## Pf3D7_05_v3_33564   PF3D7_0500400
## Pf3D7_06_v3_26967   PF3D7_0600500
## Pf3D7_06_v3_722448  PF3D7_0617400
## Pf3D7_07_v3_70826   PF3D7_0701600
## Pf3D7_07_v3_525381  PF3D7_0711900
## Pf3D7_07_v3_562933  PF3D7_0712500
## Pf3D7_07_v3_566727  PF3D7_0712600
## Pf3D7_08_v3_49758   PF3D7_0800300
## Pf3D7_08_v3_1420832 PF3D7_0833100
## Pf3D7_09_v3_64259   PF3D7_0901200
## Pf3D7_09_v3_1010517 PF3D7_0924900
## Pf3D7_11_v3_52379   PF3D7_1100500
## Pf3D7_12_v3_44423   PF3D7_1200500
## Pf3D7_12_v3_767122  PF3D7_1219300
## Pf3D7_12_v3_1715854 PF3D7_1240500
## Pf3D7_14_v3_30103   PF3D7_1400900
## Pf3D7_14_v3_1375581 PF3D7_1434400
  1. Remove problematic genomic regions to map reads
Pfal_rGenome_object = 
  filter_loci(Pfal_rGenome_object,
              v = Pfal_rGenome_object@data$loci_table$Alignment_Filter == 'PASS')

3 Section 2: Differentiating monoclonal and polyclonal infections

  1. Define the fraction of heterozygous loci per sample
Pfal_rGenome_object@data$metadata$fracHet = 
  get_ObsHet(obj = Pfal_rGenome_object, by = 'sample')
  1. Within host divergence (Fws) per sample
Pfal_rGenome_object@data$metadata$Fws = 
  get_Fws(obj = Pfal_rGenome_object, w = 1, n = 1)
  1. Visualize distribution of Heterozygosity and Fws
Pfal_rGenome_object@data$metadata %>%
  ggplot(aes(x = fracHet, y = Fws))+
  geom_point()+
  geom_hline(yintercept = .97)+
  geom_hline(yintercept = .88)+
  geom_vline(xintercept = .015)+
  geom_vline(xintercept = .06)+
  facet_grid(.~Country)+
  labs(x = 'Fraction of heterozygous loci')+
  theme_bw()

  1. Defining Monoclonal, Highly related polyclonal, and Polyclonal infections
Pfal_rGenome_object@data$metadata %<>% 
  mutate(Clonality = case_when(
    Fws >= .97  ~ 'Monoclonal',
    Fws < .97 & Fws >= .88 ~ 'Highly related Polyclonal',
    Fws < .88 ~ 'Polyclonal'
  ))
  1. Save data
rm(list = ls()[-grep('Pfal_rGenome_object',ls())])

save.image('../GeneticDiversity/PerVen_Pviv_filtered_rGenome.RData')